在这项工作中,我们专注于生成嘈杂的,教学视频的图形表示,以供视频理解。我们提出了一种自制,可解释的方法,该方法不需要任何图形表示的注释,这将是昂贵且耗时的。我们试图通过呈现语义视频图或SVGraph来克服“黑匣子”学习限制,这是一种多模式的方法,它利用叙述来实现学习图的语义解释性。SVGraph 1)依靠多种方式之间的一致性来学习统一的图形结构,并借助跨模式的注意力和2)在语义分配的帮助下分配语义解释,从而从视频叙述中捕获语义。我们在多个数据集上执行实验,并演示语义图学习中SVGraph的解释性。
translated by 谷歌翻译
与单模式学习相比,大型数据集上的联合视觉和语言建模最近在多模式任务中表现出了良好的进步。但是,这些方法对现实世界扰动的鲁棒性尚未被研究。在这项工作中,我们对此类模型进行了首次广泛的鲁棒性研究,以针对针对视频和语言的各种现实世界的扰动。我们专注于文本到视频检索,并提出了两个大型基准数据集,即MSRVTT-P和YouCook2-P,它们利用了90个不同的视觉和35个不同的文本扰动。该研究揭示了一些有趣的发现:1)当文本受到干扰而不是视频扰动时,研究的模型更加可靠。 3)与跨注意时,使用两个分支编码器通常更健壮。我们希望这项研究能够作为基准,并指导强大的多模式学习的未来研究。
translated by 谷歌翻译
The remarkable success of deep learning in various domains relies on the availability of large-scale annotated datasets. However, obtaining annotations is expensive and requires great effort, which is especially challenging for videos. Moreover, the use of human-generated annotations leads to models with biased learning and poor domain generalization and robustness. As an alternative, self-supervised learning provides a way for representation learning which does not require annotations and has shown promise in both image and video domains. Different from the image domain, learning video representations are more challenging due to the temporal dimension, bringing in motion and other environmental dynamics. This also provides opportunities for video-exclusive ideas that advance self-supervised learning in the video and multimodal domain. In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain. We summarize these methods into four different categories based on their learning objectives: 1) pretext tasks, 2) generative learning, 3) contrastive learning, and 4) cross-modal agreement. We further introduce the commonly used datasets, downstream evaluation tasks, insights into the limitations of existing works, and the potential future directions in this area.
translated by 谷歌翻译
我们专注于新建人类行动综合的问题。鉴于动作视频,目标是从看不见的视点生成相同的操作。当然,新颖的视图视频合成比图像合成更具挑战性。它需要具有时间一致性的一系列现实帧的合成。此外,将不同的动作传送到新颖的目标视图,需要了解行动类别和同时改变的观点。为了解决这些挑战,我们提出了一种名为姿势引导动作可分离生成的对抗网(PAS-GaN)的新颖框架,其利用姿势来缓解这项任务的难度。首先,我们提出了一种经常性的姿势变换模块,其将来自源视图的动作转换为目标视图,并在2D坐标空间中生成新颖的视图姿势序列。其次,经过良好变化的姿势序列使我们能够在目标视图中分离行动和背景。我们使用新颖的本地全局空间转换模块,在目标视图中有效地生成序列视频特征,使用这些动作和背景特征。最后,生成的视频特征用于在3D解码器的帮助下综合人类行动。此外,要专注于视频中的动态动作,我们提出了一种新颖的多尺度动作可分离损耗,进一步提高了视频质量。我们对两种大型多视图人体行动数据集,NTU-RGBD和PKU-MMD进行了广泛的实验,展示了PAS-GaN的有效性,这优于现有的现有方法。
translated by 谷歌翻译
近年来,我们在视频动作识别方面取得了巨大进展。有几种基于卷积神经网络(CNN)的模型,采用了一些基于变压器的方法,可在现有基准数据集上提供最先进的性能。但是,对于这些模型,尚未研究大规模的鲁棒性,这对于现实世界应用而言是关键方面。在这项工作中,我们对这些现有模型进行大规模鲁棒性分析,以供视频识别。我们主要关注因现实世界扰动而不是对抗性扰动引起的分配变化的鲁棒性。我们提出了四个不同的基准数据集,即HMDB-51P,UCF-101P,Kinetics-400P和SSV2P,并研究了六种针对90种不同扰动的六种不同最先进的动作识别模型的鲁棒性。该研究揭示了一些有趣的发现,1)基于变压器的模型与基于CNN的模型相比,对于大多数扰动,基于变压器的模型始终更健壮,2)预训练有助于基于变压器的模型比基于CNN的模型更适合不同的扰动,而3)所有研究的模型对动力学数据集的时间扰动都具有鲁棒性,但在SSV2上却不是。这表明时间信息对于SSV2数据集的动作标签预​​测比动力学数据集更为重要。我们希望这项研究能够作为在强大的视频行动识别中进行未来研究的基准。有关该项目的更多详细信息,请访问https://rose-ar.github.io/。
translated by 谷歌翻译
在这项工作中,我们专注于半监督学习的视频动作检测,该学习既利用标签和未标记的数据。我们提出了一种简单的基于端到端一致性的方法,该方法有效地利用了未标记的数据。视频动作检测需要,行动类预测以及动作的时空定位。因此,我们研究了两种类型的约束,分类一致性和时空的一致性。视频中主要背景和静态区域的存在使得利用时空的一致性进行动作检测使其具有挑战性。为了解决这个问题,我们提出了两个新颖的正规化约束,以实现时空的一致性。 1)时间相干性和2)梯度平滑度。这两个方面都利用视频中的动作的时间连续性,并且被发现有效利用未标记的视频进行动作检测。我们证明了所提出的方法对两个不同的动作检测基准数据集的有效性,即UCF101-24和JHMDB-21。此外,我们还展示了YouTube-VOS上提出的视频对象分割方法的有效性,该方法证明了其概括能力,与最近完全监督的方法相比,提出的方法仅在UCF101-24上仅使用20%的注释来实现竞争性能。在UCF101-24上,与监督方法相比,它分别在0.5 F-MAP和V-MAP时分别提高了 +8.9%和 +11%。
translated by 谷歌翻译
Designing experiments often requires balancing between learning about the true treatment effects and earning from allocating more samples to the superior treatment. While optimal algorithms for the Multi-Armed Bandit Problem (MABP) provide allocation policies that optimally balance learning and earning, they tend to be computationally expensive. The Gittins Index (GI) is a solution to the MABP that can simultaneously attain optimality and computationally efficiency goals, and it has been recently used in experiments with Bernoulli and Gaussian rewards. For the first time, we present a modification of the GI rule that can be used in experiments with exponentially-distributed rewards. We report its performance in simulated 2- armed and 3-armed experiments. Compared to traditional non-adaptive designs, our novel GI modified design shows operating characteristics comparable in learning (e.g. statistical power) but substantially better in earning (e.g. direct benefits). This illustrates the potential that designs using a GI approach to allocate participants have to improve participant benefits, increase efficiencies, and reduce experimental costs in adaptive multi-armed experiments with exponential rewards.
translated by 谷歌翻译
Quadruped robots are currently used in industrial robotics as mechanical aid to automate several routine tasks. However, presently, the usage of such a robot in a domestic setting is still very much a part of the research. This paper discusses the understanding and virtual simulation of such a robot capable of detecting and understanding human emotions, generating its gait, and responding via sounds and expression on a screen. To this end, we use a combination of reinforcement learning and software engineering concepts to simulate a quadruped robot that can understand emotions, navigate through various terrains and detect sound sources, and respond to emotions using audio-visual feedback. This paper aims to establish the framework of simulating a quadruped robot that is emotionally intelligent and can primarily respond to audio-visual stimuli using motor or audio response. The emotion detection from the speech was not as performant as ERANNs or Zeta Policy learning, still managing an accuracy of 63.5%. The video emotion detection system produced results that are almost at par with the state of the art, with an accuracy of 99.66%. Due to its "on-policy" learning process, the PPO algorithm was extremely rapid to learn, allowing the simulated dog to demonstrate a remarkably seamless gait across the different cadences and variations. This enabled the quadruped robot to respond to generated stimuli, allowing us to conclude that it functions as predicted and satisfies the aim of this work.
translated by 谷歌翻译
Real-world robotic grasping can be done robustly if a complete 3D Point Cloud Data (PCD) of an object is available. However, in practice, PCDs are often incomplete when objects are viewed from few and sparse viewpoints before the grasping action, leading to the generation of wrong or inaccurate grasp poses. We propose a novel grasping strategy, named 3DSGrasp, that predicts the missing geometry from the partial PCD to produce reliable grasp poses. Our proposed PCD completion network is a Transformer-based encoder-decoder network with an Offset-Attention layer. Our network is inherently invariant to the object pose and point's permutation, which generates PCDs that are geometrically consistent and completed properly. Experiments on a wide range of partial PCD show that 3DSGrasp outperforms the best state-of-the-art method on PCD completion tasks and largely improves the grasping success rate in real-world scenarios. The code and dataset will be made available upon acceptance.
translated by 谷歌翻译
When robots learn reward functions using high capacity models that take raw state directly as input, they need to both learn a representation for what matters in the task -- the task ``features" -- as well as how to combine these features into a single objective. If they try to do both at once from input designed to teach the full reward function, it is easy to end up with a representation that contains spurious correlations in the data, which fails to generalize to new settings. Instead, our ultimate goal is to enable robots to identify and isolate the causal features that people actually care about and use when they represent states and behavior. Our idea is that we can tune into this representation by asking users what behaviors they consider similar: behaviors will be similar if the features that matter are similar, even if low-level behavior is different; conversely, behaviors will be different if even one of the features that matter differs. This, in turn, is what enables the robot to disambiguate between what needs to go into the representation versus what is spurious, as well as what aspects of behavior can be compressed together versus not. The notion of learning representations based on similarity has a nice parallel in contrastive learning, a self-supervised representation learning technique that maps visually similar data points to similar embeddings, where similarity is defined by a designer through data augmentation heuristics. By contrast, in order to learn the representations that people use, so we can learn their preferences and objectives, we use their definition of similarity. In simulation as well as in a user study, we show that learning through such similarity queries leads to representations that, while far from perfect, are indeed more generalizable than self-supervised and task-input alternatives.
translated by 谷歌翻译